DECTNet: Dual Encoder Network combined convolution and Transformer architecture for medical image segmentation

Automatic and accurate segmentation of medical images plays an essential role in disease diagnosis and treatment planning. Convolution neural networks have achieved remarkable results in medical image segmentation in the past decade. Meanwhile, deep learning models based on Transformer architecture also succeeded tremendously in this domain. However, due to the ambiguity of the medical image boundary and the high complexity of physical organization structures, implementing effective structure extraction and accurate segmentation remains a problem requiring a solution. In this paper, we propose a novel Dual Encoder Network named DECTNet to alleviate this problem. Specifically, the DECTNet embraces four components, which are a convolution-based encoder, a Transformer-based encoder, a feature fusion decoder, and a deep supervision module. The convolutional structure encoder can extract fine spatial contextual details in images. Meanwhile, the Transformer structure encoder is designed using a hierarchical Swin Transformer architecture to model global contextual information. The novel feature fusion decoder integrates the multi-scale representation from two encoders and selects features that focus on segmentation tasks by channel attention mechanism. Further, a deep supervision module is used to accelerate the convergence of the proposed method. Extensive experiments demonstrate that, compared to the other seven models, the proposed method achieves state-of-the-art results on four segmentation tasks: skin lesion segmentation, polyp segmentation, Covid-19 lesion segmentation, and MRI cardiac segmentation.

Convolution Neural networks(CNNs) have achieved considerable success in medical image analysis in the past decade.They have achieved state-of-the-art performance in a large number of segmentation tasks.Compared with traditional methods, CNNs have superior modeling representation capability and automatically learn task-relevant features in medical images.In particular, the most extensive application is the UNet [16] architecture based on the encoder and decoder.In UNet, the encoder analyzes the semantic image information and learns highdimensional features, and the decoder performs image segmentation according to the representations.Since the encoder lacks the details of the original image when learning high-dimensional representations, skip connections are employed to supplement the detailed information in the decoder's features, and this technique significantly improves the image segmentation performance.
With the great success of Transformer architecture in the field of NLP, several studies have been devoted to transferring Transformer architecture to the computer vision domain.The Vision Transformer(ViT) model [17] is the first to demonstrate the efficacy of the Transformer architecture in computer vision.The advantage of the ViT model is that it can establish the dependence between long-term pixels in images, improving the performance in visual tasks.Swin Transformer [18] also demonstrates the efficacy of Transformer architecture with hierarchical representations for vision tasks.However, different from CNNs, the Transformer architectures need more annotated samples to perform better than the convolutional models due to the lack of inductive bias.The convolution-based models can extract low-dimensional semantic information from images and establish fine spatial detail features, but they are not perfect for establishing global relationships due to the limits of the receptive field.On the contrary, the Transformer architectures provide excellent compensation by achieving global information modeling, which is the drawback of the convolution-based models.For this reason, we combine the advantages of convolution and Transformer structures to design a novel medical image segmentation model.
Specifically, we propose a segmentation model with a dual encoder that enables combining the advantages of the convolution and Transformer architectures.The convolution encoder with channel and spatial attention can sufficiently extract the local context information.In contrast, the Swin Transformer architecture is utilized as the fundamental component of the Transformer encoder, which is based on the window and shift-window self-attention technique and is sensitive to global context information.It is proved in [19,20] that the visual features extracted by the Transformer architecture differ from the convolution architecture.Guo et al. [21] proves that a strong backbone is a key to semantic segmentation.In addition, to improve the Swin Transformer architecture for dense prediction tasks, we designed a STP Block in the Swin Transformer encoder to enable the encoder to extract shallow semantic information in the initial stage.
Similar to UNet, DECTNet employs the skip connection between the encoder and the decoder.However, because the stage of the convolutional encoder outputs features at different scales from the stage of the Transformer encoder, we design a feature fusion decoder stage.It can fuse features of different scales from different stages and select the representations beneficial to specific tasks by the unique feature selection module.We further design a deep supervision module to supervise the decoder stage outputs.It can accelerate the convergence of the model and enable the model to obtain superior segmentation performance.
Like our proposed DECTNet, Li et al [22] designed a segmentation model with a dual encoder based on convolution and Transformer architectures named CATS.However, CATS applies the vanilla ViT [17] architecture in the Transformer-based encoder, which increases the parameters and the computational complexity of the segmentation model.In addition, the DECTNet contains abundant channels and spatial attention, feature selection, and fusion modules to extract suitable representations from redundant features for specific segmentation tasks, which is not available with the CATS method but is essential for dense prediction tasks.In summary, the main contributions of this paper are reflected in the following three aspects: • We propose a novel segmentation method called DECTNet, which has two encoders based on convolution and Swin Transformer architectures.The convolution-based encoder through the dense connections and CBAM module enables to select the specific task-relevant local features.Meanwhile, by combining the STB module, the standalone and entire Swin Transformer encoder allows the efficient establishment of the global representation in images.
• To learn knowledge of the features from different encoders, we design a practical feature fusion decoder, which integrates the representation from the convolution encoder and the Swin Transformer encoder.It enables to select the valuable information from features for segmentation tasks.
• We perform extensive experiments in four medical image segmentation tasks.Experiments show that our model performs best on most metrics compared to other state-of-the-art models.Furthermore, the ablation experiments demonstrated the effect of each component of the proposed DECTNet.
The remainder of this paper is arranged as follows.Section Related works introduces the related works.Section Materials and methods describes the proposed DECTNet in detail.Section Experiment and Section Results present the experiment implementation and results.Section Discussion and Section Conclusion contain the discussion and conclusions.

Related works
According to the different network structures, the existing medical image segmentation methods based on deep learning can be divided into two categories: 1) the methods based on convolution structures; and 2) the methods that combine convolution and the self-attention mechanism.Although some models entirely adopt the self-attention mechanism as the feature extraction technology, the number of these models is relatively small, so we classify them into the second category.

Methods based on convolution network
In recent years, numerous convolution-based segmentation models have significantly progressed in various segmentation tasks.In particular, FCN [23] is one of the most notable approaches.It has the analysis and synthesis path consisting of cascaded convolution, pooling, and deconvolution layers.Compared to FCN, the UNet has a skip connection between the analysis and synthesis paths, which supplements the detailed information lacking in the features of synthesis paths.Because of its flexible structure, UNet is widely applied in medical image segmentation.
Compared to UNet, the UNet++ [24] and CPFNet [25] apply additional skip connections between the encoder and decoder to aggregate features from different semantic scales.In addition, DeepLabv3+ [26] and CENet [27] employ multiple convolutional branches with various receptive fields to improve the multi-scale information capture capability of the model.
As the channel and spatial attention are effective in visual tasks [28,29], several approaches combining convolution and attention mechanism have emerged.For example, [30,31] use channel attention to guide shallow-level learning of global feature representation.In contrast, the attention UNet proposed in [32] combines spatial attention with UNet structure for abdominal pancreas segmentation from CT images.Roy et al. [33] propose a scSE framework that combines spatial attention and channel attention, which has been proven effective on whole brain and abdominal multiple organs segmentation tasks.In addition, Song et al. [34] and other works [35,36] also start from the perspective of the combination of convolution and attention mechanism to design models to deal with specific medical image segmentation tasks.

Methods combined convolution with self-attention mechanism
With the success of the Transformer architecture in natural language processing, its application to computer vision has become a popular research domain.ViT [17] is the first method to successfully employ the Transformer architecture to solve computer vision problems.It divides images into patches and establishes global context dependencies by computing the similarity between patches.The Swin Transformer [18] is an improved version of the ViT.It introduces the window and shift-window self-attention module to reduce the computational complexity of the Transformer structures.Furthermore, different from ViT, Swin Transformer allows for producing multiple scales of features, essential for applying the Transformer structure on dense predictions, such as object detection and image segmentation.
Recently, researchers have attempted to combine traditional convolution with the selfattention mechanism to design models with superior performance in medical image segmentation tasks.In particular, TransUNet [37] combines ViT and UNet, which employs an encoderdecoder and convolution portion to extract local contextual information in high-resolution, low-dimension features.It also uses the Transformer portion to establish the global relationships of patches in low-resolution, high-dimension features.TransUNet establishes global dependencies between patches of high-dimensional representation through self-attention but fails to establish long-term reliance between the patches of low-dimensional semantic features.Cao et al. [38] proposed the SwinUNet, which employs the Swin Transformer Blocks in the UNet backbone instead of convolution as the feature extraction technique.After the pre-training process, this method performs excellently on multi-organ and cardiac segmentation tasks.
In addition, Hung et al. [39] designed a cross-slice attention Transformer module.Combining it with convolutional networks and skip connections enables the accurate segmentation of prostate partitions in MRI.Liu et al. [40] designed the MCTHNet by integrating convolution and transformer structures for multi-modal medical image segmentation with limited annotation, and their approach achieved the best semi-supervised results on several multi-modal datasets.Furthermore, TransFuse [41], Medical Transformer [42], TransBTS [43], FCT [44], and HiFormer [45] combine self-attention with convolutional networks to achieve excellent results in specific medical image segmentation tasks.
Although increasing research has been focused on combining the self-attention mechanism and convolution to improve the performance of segmentation models, only some studies have noticed that the strong backbone of the Transformer architecture may have positive impacts on image segmentation.Therefore, we combine the Transformer structure rather than the simple self-attention technique with convolution architecture.Each stage combines a dense connection block [46] and a CBAM block [28].In a single stage, the features pass into the dense connection block for adequate information extraction and then through the CBAM module to further enhance the valuable information for the segmentation task.It should be noted that in a single stage, the scale and channels of the features are not changed.

Convolution-based encoder
Compared with the Residual block, the dense connection block has a more abundant residual connection, which can fully utilize features and reduce the impact of gradient disappearance.When the features are output from the dense connection block, they are fed to the CBAM module.The role of the CBAM module is to add a spatial attention mask and channel attention mask to the features.It can make the model pay more attention to the task-related information and ignore the task-independent information of the features.Compared to the SE module [29], the CBAM module adds more affluent spatial attention, enhancing the valid information in the features.
Due to the DECTNet having two encoders, the model parameters increase significantly.In order to reduce the model parameters, we use the depth separable convolution [47] instead of the traditional convolution in the C-encoder.In addition, we use a traditional convolution with a step length of 2 and a convolution kernel size of 2 as the downsampling operation, reducing the scale of the features while increasing its channels.

Swin Transformer-based encoder
The other encoder of the DECTNet consists of the Swin Transformer architecture, detailed in Fig 3(a).Compared with the original ViT architecture, the Swin Transformer structure yields hierarchical features essential for image segmentation.Fig 3(a) also shows the scale of features through each stage of the ST-Encoder.The structure of Patch Merging blocks is identified with the Swin Transformer [18].
Compared with image recognition, semantic segmentation needs more semantic details and structured information, such as spatial context information.Therefore, using large steps to initialize the image, just as in the traditional ViT method, results in the absence of semantic information in raw images.Small-step convolution operation instead of this process improves the performance of models for semantic segmentation tasks.Therefore, we designed the STP block, as shown in Fig 3(b).It extracts low-dimensional features from the image using the structure of small-step convolution of residual blocks.The purpose of the STP block is to adjust the resolution and channels of low-dimensional features to provide features rich in The W-MSA and the SW-MSA structures are described in [18].In the Swin Transformer Block, the procedure for processing the features can be formulated as follows: where ẑl and z l are the output features of the (S)W-MSA and the MLP module at layer l.Due to the complementary and similarities between the W-MAS and the SW-MSA, even numbers of Swin Transformer Block are required at each Swin Transformer encoder stage, which is [2,2,6,2] in the proposed DECTNet, just as shown in Fig 3(a).

Feature fusion decoder
Due to the dual encoder architecture of the DECTNet, the decoder is required to receive different scale features from the skip connection.Therefore, we design a novel Feature Fusion Decoder to suit the particular structure of the proposed method.As shown in Fig 1, similar to the two encoders in the DECTNet, the proposed F-Decoder consists of several same structure stages.In addition to the stage that processes the lowest resolution features, each decoder stage The purpose of the feature selection block is to reinforce the meaningful information about the features from the feature aggregation block through the attention mechanism.It applies the SE module [29] to add channel attention to the features, enhancing features that benefit tasks and ignoring the tasks-irrelevant features.It should be noted that the initial stage of the F-Decoder, that is, the intersection of the C-Encoder stage and the ST-Encoder stage, has only two part features as inputs, which are from the C-Encoder stage and the ST-Encoder stage since there are no features from the prior stage of F-Decoder.

Deep supervision and loss function
It demonstrates in [48,49] that the deep supervision technique could accelerate the convergence of the segmentation method and achieve better performance.For this reason, we introduce a deep supervision module to supervise the output features of decoder stages.The detail of the DS-Module is shown in Fig 5.
The structure of the DS-Module is quite simple.For the output features of a particular decoder stage, the DS-Module consists of the Convolution, Batch Normalization, and UpSample operations, where transposed convolution is adopted to implement UpSample.For features of different scales output from different stages, DS-Module first uses 1 × 1 convolution and batch normalization to adjust the number of channels to the same.Then it uses transpose convolution to upsample feature size from different stages to label size, and the channels are adjusted to classification quantity.Thus, the confidence maps produced by the F-Decoder and DS-Module have identical resolutions.Both the F-Decoder and the DS-Module output confidence maps with the same resolution, and each confidence map generates a loss with the label during the training process.Therefore, the loss function of the DECTNet can be defined as the sum of all losses between the prediction confidence maps and the corresponding labels.It can be defined as: where K represents the quantity of prediction confidence maps, l k represents the loss calculated by the kth prediction confidence map and segmentation mask, and ω k is the weight of the corresponding loss.In this paper, we apply Dice Loss and cross-entropy loss as the l k , which can be defined as: In DECTNet, K and α are set to 4 and 0.5, respectively.l 4 refers to the loss resulting from the final confidence maps of the model against the ground truth.l 1 -l 3 are the losses established from the confidence maps output by the DS-Module.To ensure the dominance of l 4 in overall loss, we set ω 4 as 0.7, and ω 1 -ω 3 are all set to 0.1.

Datasets
In this paper, we evaluate the segmentation performance of the proposed DECTNet in four medical image segmentation tasks: skin lesion segmentation, Covid-19 lesion segmentation, polyp segmentation, and cardiac segmentation.For the skin lesion segmentation task, we utilize the ISIC2017 dataset [50], collected from different leading clinical centers internationally and acquired from different devices.This dataset includes 2000 images for training, 150 for validation, and 600 for testing.The QaTa-COV19 dataset [51] is employed for the Covid-19 lesion segmentation task, collected by researchers from Qatar University and Tampere University.This dataset contains 9258 chest radiographs of COVID-19 lesions with manual annotation, which includes 7145 images in the training dataset and 2113 images in the test dataset.And we choose randomly 750 samples from the training dataset as the validation dataset.
The dataset for the cardiac segmentation task is provided by the Multi-Centre, Multi-Vendor & Multi-Disease (M&Ms) Cardiac Image Segmentation Challenge [59], which was acquired at six different clinical centers using MRI scanners from four vendors.The samples are segmented by experienced clinicians from the respective institutions, including contours for the left ventricle(LV) and right ventricle (RV), as well as for the left ventricular myocardium (MYO).Due to the M&Ms dataset containing 3D samples, we convert the 3D voxel samples into 2D slices, in which the training dataset, validation dataset, and test dataset, respectively, consist of 3518, 450, and 1024 images.

Implementing details and evaluation metrics
The implementation of the proposed DECTNet is based on the public PyTorch platform and NVIDIA GeForce RTX 3090.During the training process, we adopt stochastic gradient descent(SGD) as the optimizer to optimize our method, where the momentum and weight decay are set to 0.9 and 0.0001, respectively.The initial learning rate is set to 0.004, and the "poly" learning rate policy is adopted, which is the initial learning rate multiplied by 1 À iter total iter À � power , where power is set to 0.9.In addition, The batch size and maximum epoch are set to 16 and 100, respectively.Since samples from different datasets have different scales and distributions, firstly, we uniformly resized the input samples as 224×224.Then we normalized the samples as zero mean and unit variance, and finally, the random rotation and flip operations were adopted as the data augmentation strategy.In DECTNet, the total number of training parameters is 12.6M, with the initial channels of the convolution structure set to 32 and the initial channels of the transformer structure set to 48.When the initial channels of the convolution structure are set to 16 and the initial channels of the transformer structure are set to 24, the total number of training parameters is reduced to 3.2M.
In order to adequately evaluate the performance of the proposed method, we employ eight evaluation metrics in the four segmentation tasks, which are the Dice similarity coefficient (DSC), Jaccard Index(Jacc), Accuracy(Accu), Sensitivity(Sens), Precision(Prec), Specificity (Spec), Average symmetric surface distance(ASD) and Hausdorff distance (HD).The different evaluation metrics are utilized for different segmentation tasks.

Skin lesion segmentation
We first verify the effect of the proposed DECTNet in the skin lesion segmentation task.The Dice, Jaccard, Accuracy, Sensitivity, Precision, and Specificity are adopted as the evaluation metrics in this task.Table 1 shows the quantitative results of our model and other segmentation methods.Our model achieved the best performance in mean Dice, Jaccard, Accuracy, and Sensitivity, which are 86.36%,78.38%, 94.91%, and 84.46%, respectively.Compared to the best results produced by other methods, the results obtained from DECTNet are enhanced by 0.70% on Dice(85.66%,CENet), 0.96% on Jaccard(77.42%,CENet), 0.81% on Accuracy (94.10%,CENet) and 0.63% on Sensitivity(83.83%,Transfuse).Although our method does not achieve the best results in Precision and Specificity, it remains competitive compared to other methods.
Fig 6 gives several visual samples and segmentation masks produced by our and other methods.The red curves in the figure are the contours of the ground truth corresponding to the samples.For convenient contrast, we integrate ground truth contours with the samples and the segmentation masks derived from the different methods.These samples show that the segmentation masks of our approach are very close to the ground truth.In contrast with the segmentation masks of other methods, our masks are a better fit for the lesions that need to be selected.

Covid-19 lesion segmentation
Then, we evaluate the performance of the proposed DECTNet in the Covid-19 lesion segmentation task.The comparison experiment for the Covid-19 segmentation task is similar to the skin lesion segmentation task.2, our approach achieves significant gains in each evaluation metric, which intuitively demonstrates the effectiveness of the diverse modules designed in DECTNet.

Polyp segmentation
Further, we validate the segmentation performance of the proposed method in the polyp segmentation task.As described in Section Experiments, this segmentation task contains five different polyp segmentation datasets.In order to explicitly represent the generalization performance of different models, the evaluation metrics of different models on each polyp segmentation dataset are displayed in Table 3.For the convenience of presentation, we adopt  Dice and Jaccard as the evaluation metrics.Table 3 shows that DECTNet achieves the highest mean Dice and Jaccard in the four datasets except for the ETIS dataset, which also achieves the suboptimal result.It proves that in comparison with other models, our DECTNet has improved generalization performance in the polyp segmentation task.

Cardiac segmentation
We finally perform the comparison experiment of the DECTNet and other segmentation models in the cardiac segmentation task.Unlike the previous segmentation task based on binary classification, cardiac segmentation is a multi-class segmentation task, including the left ventricle, right ventricle, and myocardium.Table 4 gives the quantitative comparison results of different categories in the cardiac segmentation task.Following [59], the Dice, Jaccard, HD, and ASD are adopted as the segmentation performance metrics in this task, where HD and ASD are measured in pixels.And in Table 4, RV, MYO, and LV represent the right ventricle, myocardium, and left ventricle, while p refers to pixel.12, the red, green, and blue regions respectively correspond to the segmentation masks of the right ventricle, myocardium, and left ventricle, where the segmentation masks in the "Original Image" are labeled with corresponding ground truth.Compared with other segmentation masks, the superiority of segmentation masks resulting from our method is in the region of the right ventricle(red).It is observed from the visual samples that for the left ventricle(blue) and myocardium(green) regions, the segmentation masks produced by different methods have little difference concerning the ground truth.But for the cases of the right ventricle(red), our segmentation masks have tremendous advantages compared with other masks.Our method enables us to distinguish the target region more accurately.
As shown in

Ablation study
In order to investigate the effectiveness of the proposed model and its components, we evaluate the DECTNet on skin lesion segmentation task and Covid-19 lesion segmentation task for ablation study.In the ablation experiments, we primarily validate the effects of four modules: the encoder module based on Convolution architecture, the encoder module based on Transformer architecture, the decoder module with feature integration and selection module, and the deep supervision module.Tables 5 and 6  results and visual comparisons of the ablation experiments in two segmentation tasks.In Tables 5 and 6 the decoder in the "Dual-encoder + Sim decoder" architecture adjusts the scales of feature maps from different encoders with additional convolution layers.The "w/o DeepSup" structure indicates the DECTNet without the deep supervision module.Note that the segmentation models containing the swim transformer structure are trained from scratch without pretrained parameters.
From Tables 5 and 6, it is observed that compared to the segmentation performance of UNet, the evaluation metrics acquired by the "C-encoder + Sim decoder" model are enhanced, which shows the capability of the designed module in the C-encoder structure.The segmentation performance of the "T-encoder + Sim decoder" model is significantly degraded, probably due to insufficient training samples to make the Swin Transformer structure effective alone.Nevertheless, the "Dual-encoder + Sim decoder" structure achieves superior segmentation performance than the "C-encoder + Sim decoder" and "T-encoder + Sim decoder" structures, which demonstrates that the Swin Transformer structure could capture the effective representations that are difficult to obtain with convolutional structures.In addition, the enhanced capability of the "w/o DeepSup" and DECTNet structures demonstrate the benefits of the feature integration and selection module and the deep supervision module, respectively.
Increasing studies have been developed to improve the interpretability of segmentation methods through CAM-like approaches [60,61] to enhance the confidence of prediction results of deep learning models.Following [61], we create the heat maps based on grad-CAM [60] to visualize and interpret the segmentation results of the proposed DECTNet.
As shown in Fig 14, we take the heatmaps and segmentation masks for comparison.The (a) refers to the samples and the corresponding labels from two segmentation datasets, where the

Discussion
To achieve better accurate and robust performance in the medical image segmentation domain, we propose the DECTNet achieve this purpose.The DECTNet is a dual-encoder method based on Convolution and Swin Transformer architecture.To efficiently acquire local context information from images, we design the convolution encoder stage consisting of a Dense Connection block and a CBAM block, where the Dense Connection block delivers abundant semantic information, and the CBAM block utilizes spatial and channel attention masks to enhance semantic features that are beneficial for specific segmentation tasks.Furthermore, we employ depth-separable convolution instead of traditional convolution in the convolution encoder to reduce the model parameters.To better capture the global context dependence of medical images, we design the Swin Transformer encoder in the proposed DECTNet.Moreover, we incorporate the STP blocks in the Swin Transformer encoder for the preliminary image processing to make the Swin Transformer structure suitable for semantic segmentation tasks.Since different encoders yield different scale features, we design a feature fusion decoder to integrate and select features considered significant for the segmentation tasks.In addition, to accelerate the convergence and improve the segmentation performance of the proposed method, we add a simple but effective deep supervision module to supervise the decoder stage outputs.
To perform the performance of the proposed DECTNet, we evaluate the method on four different medical image segmentation tasks to demonstrate its effectiveness and robustness.The comparison experimental results corresponding to Tables 1-4 show that the DECTNet achieves state-of-the-art results than other segmentation models.Figs 6, 8, 10 and 12 show that our model has superior segmentation accuracy and generalization performance compared to other segmentation models.To further prove the efficacy of each component of the proposed DECTNet, we perform ablation studies on the skin lesion segmentation task and the Covid-19 lesion segmentation task.The corresponding figures and tables of the ablation experiments show that the dual encoder structure with convolution and Swin Transformer architecture improves the model's performance compared to a single structure encoder.Moreover, it is also seen that the feature fusion decoder and the deep supervision module also contribute to improving the performance of the proposed method.In conclusion, compared with other state-of-the-art models, DECTNet is effective in improving the accuracy of image segmentation and obtaining a favorable generalization of the segmentation method.
Compared to 2D segmentation networks such as DECTNet, 3D models allow direct processing of 3D data, which enables the segmentation model to capture volume information and efficiently understand more comprehensive spatial context representation.However, due to the parameter redundancy in the dual encoder structure of the 2D DECTNet, we need to reoptimize the model parameters to develop a robust and generalizable 3D DECTNet method.
In contrast to the single-encoder segmentation network, the dual-encoder structure of DECTNet increases the model complexity and requires additional annotation data to mitigate the overfitting risks.Besides that, this complexity expands the number of hyperparameters, making it more difficult and time-consuming to find the best hyperparameter configurations.Future enhancements can be approached in two ways.On the one hand, leveraging transfer learning, complex models are pre-trained on sufficiently large datasets and establish a more effective framework, which allows complex models to obtain more robust generic visual features and diminishes the possibility of overfitting.On the other hand, based on the Transformer architecture, the model structures can be refined by developing a less parameterized yet more efficient global feature extraction module, which aims to enhance the efficiency and accuracy of the segmentation backbone.

Conclusion
In this paper, we propose a novel model with a dual encoder structure named DECTNet, for medical image segmentation.We use the convolution encoder and Swin Transformer encoder to extract local and global context information to obtain the significant hierarchical representation from medical images.Then, we design a feature fusion decoder to integrate and select the representation acquired from the convolution and Swin Transformer encoders.Further, We employ a deep supervision module to supervise the multi-scale features in the decoder stages.Our method demonstrates significant advantages through the experiments compared to the state-of-the-art methods on four public segmentation tasks.

Fig 1
Fig 1 illustrates the overall architecture of our proposed DECTNet.The DECTNet is based on two-encoders-single-decoder architecture and consists of four main parts: convolution structure encoder, Swin Transformer structure encoder, feature fusion decoder, and deep As shown inFig 1, the convolution-based encoder of the DECTNet consists of four stages.Like the UNet encoder stage, each part in the C-encoder has the same structure but operates on different scale features.Except for the deepest stage, there is a downsampling operation between different stages, which reduces the scale of the features but increases the channels of features to extract high-dimension representations.The detailed design of the single stages is shown in Fig 2.

Fig 2
demonstrates the detailed structure of the C-encoder stage.

Fig 1 .
Fig 1. Overview of our proposed DECTNet approach with dual-encoder-single-decoder structure.DECTNet consists of four components: Convolution-based encoder, Swin Transformer-based encoder, Feature Fusion decoder, and Deep Supervision module.The detailed composition of each component is described in the following sections.https://doi.org/10.1371/journal.pone.0301019.g001

Fig 2 .
Fig 2. The detailed structure of the convolution encoder stage.This stage consists of the DenseConnection Block and the CBAM Block, which are applied to sufficiently extract detailed information from the images.https://doi.org/10.1371/journal.pone.0301019.g002

Fig 3 .
Fig 3. Overview of the Swin Transformer-based encoder of the proposed DECTNet.(a) Components of the SwinTransformer-based encoder.It includes scaling the features in the ST-Encoder.(b) Composition of the STP block (c) Two successive Swin Transformer Block.W-MSA and SW-MSA are multihead self-attention modules with regular and shifted windowing configurations.https://doi.org/10.1371/journal.pone.0301019.g003

Fig 4 .
Fig 4. The detailed structure of the feature fusion decoder stage.The stage has two components: the feature aggregation block and the feature selection block, which are applied to integrate and select features.https://doi.org/10.1371/journal.pone.0301019.g004

Fig 5 .
Fig 5.The detailed structure of the deep supervision module.The DS-Module converts different scale features of the F-decoder into the same scale confidence maps.https://doi.org/10.1371/journal.pone.0301019.g005

Fig 7
depicts the Dice score of different approaches on the validation dataset during the training process of the skin lesion segmentation task.For the convenience of observation, the figure merely depicts the Dice score curves on the validation set for three segmentation models, which are UNet, TransUNet, and DECTNet, respectively.It is shown that the curve acquired by our method is smoother and achieves a higher Dice score on the validation dataset than UNet and TransUNet.

Fig 6 .Fig 7 .
Fig 6.Visual comparison examples with other approaches in the skin lesion segmentation.The red contour refers to the ground truth, and different segmentation masks are produced by different methods.https://doi.org/10.1371/journal.pone.0301019.g006

Fig 8 .
Fig 8. Visual comparison examples with other approaches in the Covid-19 lesion segmentation.The red contour refers to the ground truth, and different segmentation masks are produced by different methods.https://doi.org/10.1371/journal.pone.0301019.g008 Fig 10 visualizes some segmentation masks different models produce.Compared to other counterparts, our method outlines the lesion regions more accurately and eliminates the background noise.Even in complicated samples, our model still generates a precise segmentation mask.In addition, Fig 11 shows the Dice score curves of different models on the polyp segmentation validation dataset.It can be seen that the Dice curve resulting from DECTNet has a considerable advantage compared with the other two models, which proves our proposed model has better generalization capability in the polyp segmentation task.

Fig 10 .Fig 11 .
Fig 10.Visual comparison examples with other approaches in the polyp segmentation.The red contour refers to the ground truth, and different segmentation masks are produced by different methods.https://doi.org/10.1371/journal.pone.0301019.g010

Fig 13 ,
we also validate the Dice scores of different segmentation models on the validation dataset during training.Different from the previous three segmentation tasks, we describe the mean Dice scores of the three categories(RV, MYO, LV) in Fig 13.It can be shown that the Dice score of our model on the validation dataset still slightly outperforms other segmentation models.
, and Fig 14 respectively show the quantitative

Fig 12 .Fig 13 .
Fig 12. Visual comparison examples with other approaches in the cardiac segmentation.The red, green, and blue portions refer to the right ventricle, the myocardium, and the left ventricle, respectively.Different segmentation masks are produced by different methods, where the masks in "Original Image" refer to the ground truth.https://doi.org/10.1371/journal.pone.0301019.g012 red regions represent labels.The (b) and (c) refer to the heatmaps and the corresponding segmentation masks resulting from the "C-encoder + Sim decoder" structure.The resulting heatmaps and segmentation masks produced by the DECTNet are represented by (d) and (e).As can be obtained from Fig 14, compared with (b) and (c), (d) and (e) are better suited to the

Fig 14 .
Fig 14.Visual comparison samples in Ablation study.(a) refers to the original image and the corresponding ground truth, (b) and (c) refer to the heat map and segmentation mask produced by "C-encoder + Sim decoder", and (d) and (e) to the heat map and segmentation mask produced by the DECTNet.https://doi.org/10.1371/journal.pone.0301019.g014

Table 2 and
Fig 8 respectively indicate the corresponding quantified results and visual comparisons of our model and seven other segmentation methods on the Covid-19 lesion dataset.Similar to the results on the skin lesion dataset, our DECTNet is

Table 1 . Result of comparisons with other methods in skin lesion segmentation task.
also able to obtain the best performance in terms of the mean Dice, Jaccard, Accuracy, and Precision metrics in the Covid-19 dataset.Compared to UNet in Table

Table 2 . Result of comparisons with other methods in Covid-19 lesion segmentation task.
The values are described as Mean(Standard deviation).Best results are in bold and suboptimal results are in underlined.

Table 3 . Result of comparisons with other methods in polyp segmentation task.
The values are described as Mean(Standard deviation).Best results are in bold and suboptimal results are in underlined.https://doi.org/10.1371/journal.pone.0301019.t003 Table 4 indicated that our model achieved the best results in all evaluation metrics of LV segmentation and obtained the highest Dice, Jaccard, and comparable surface-related performance in RV and MYO segmentation compared with other methods.It demonstrated that our proposed DECTNet is effective for the cardiac segmentation task.Fig 12 also gives a few samples for visual comparison.Different from visual comparison examples in preceding segmentation tasks, in Fig

Table 4 . Result of comparisons with other methods in cardiac segmentation task.
The values are described as Mean(Standard deviation).Best results are in bold and suboptimal results are in underlined.

Table 5 . Quantitative result of ablation study in skin lesion segmentation task.
The values are described as Mean(Standard deviation).Best results are in bold and suboptimal results are in underlined.

Table 6 . Quantitative result of ablation study in Covid-19 lesion segmentation task.
The values are described as Mean(Standard deviation).Best results are in bold and suboptimal results are in underlined.https://doi.org/10.1371/journal.pone.0301019.t006